It is a central problem in both statistics and computer science to understand the theoretical foundation of machine learning, especially deep learning. During the past decade, deep learning has achieved remarkable successes in solving many complex artificial intelligence tasks. The aim of this dissertation is to understand deep neural networks (DNNs) and other nonparametric methods in machine learning. In particular, three machine learning models have been studied: weight normalized DNNs, sparse DNNs, and the compositional nonparametric model.

The first chapter presents a general framework for norm-based capacity control for *L*_{p,q} weight normalized DNNs. We establish the upper bound on the Rademacher complexities of this family. Especially, with an *L*_{1,infty} normalization, we discuss properties of a width-independent capacity control, which only depends on the depth by a square root term. Furthermore, if the activation functions are anti-symmetric, the bound on the Rademacher complexity is independent of both the width and the depth up to a log factor. In addition, we study the weight normalized deep neural networks with rectified linear units (ReLU) in terms of functional characterization and approximation properties. In particular, for an *L*_{1,infty} weight normalized network with ReLU, the approximation error can be controlled by the *L*_{1} norm of the output layer.

In the second chapter, we study *L*_{1,infty}-weight normalization for deep neural networks with bias neurons to achieve the sparse architecture. We theoretically establish the generalization error bounds for both regression and classification under the *L*_{1,infty}-weight normalization. It is shown that the upper bounds are independent of the network width and *k*^{1/2}-dependence on the network depth *k*. These results provide theoretical justifications on the usage of such weight normalization to reduce the generalization error. We also develop an easily implemented gradient projection descent algorithm to practically obtain a sparse neural network. We perform various experiments to validate our theory and demonstrate the effectiveness of the resulting approach.

In the third chapter, we propose a compositional nonparametric method in which a model is expressed as a labeled binary tree of *2k+1* nodes, where each node is either a summation, a multiplication, or the application of one of the *q* basis functions to one of the *m*_{1} covariates. We show that in order to recover a labeled binary tree from a given dataset, the sufficient number of samples is *O(k *log*(m*_{1}q)+log*(k!))*, and the necessary number of samples is *Omega(k *log*(m*_{1}q)-log*(k!))*. We further propose a greedy algorithm for regression in order to validate our theoretical findings through synthetic experiments.